Time - 60-120 minutes
To briefly recap, last week we:
Today we are going to:
Generalized Linear Models (GLMs) are a critical and hugely useful technique for data analysis. They are Generalised Linear Models because they allow you to specify the error distribution of your model (e.g. gaussian, poisson, etc) and link function (e.g. identity, log, etc). This means that they are hugly flexible you can fit them to count data, binary data, or data where the predictor variables are categorical or continuous. Critically, GLMs can be fit to data where the errors are normal or non-normally distributed. Linear regressions - which I am sure you are all familiar with, or at least aware of - are a special case of the GLM, where data has a gaussian (normal) distribution with an identity link (see below).
As we said above, GLMs are fabulously flexible - they can do t-test style analyses with two or more categorical predictors, linear regressions with continuous predictor variables, and mix both categorical and continuous predictors in multiple regression analyses. We will also consider the extension to GLMs where we can account for non-randomness between samples (so called mixed models). GLMs are complex beasts, but are widely used to analyse biological data because of its often nested nature, and lack of normality, and so we are going to cover this topic in-depth here rather than spread thinly across many different statistical techniques. We will touch on some other statistics in upcoming lectures.
The following two sections are a quick refresher on error distributions and link functions. Feel free to skip these if you are comfortable with the topics.
Last week we talked about normally distributed data, and we learned what that means in terms of visualizing data in a histogram (classic bell-curve). In reality, that was a slight simplification of what we mean when we talk about normally distributed data. What we are really interested in is normally distributed error.
Well below we have a histogram of data where the mean of the data is shown with a dashed line. When we are running statistical test we are essentially comparing the mean values of different groups (e.g. height as we did last week), and looking at how varied the data are around those means to estimate how likely they are to be statistically different between groups. This variance around the mean is our “error” - effectively, how wrong the mean value of the group estimated by the model is when looking at all of the observed data. It is this error around the mean that we want to be normally distributed (if we are fitting a model which assumes a normal distribution, like the t-test we did last week).
So, the error distribution specifies the structure of the error in the model - i.e. the variance in the data not explained by the predictor variables (e.g. sex explaining height).
To extend beyond the t-test we looked at last week, lets take the example of a Gaussian (normal) distribution where our predictor variable isn’t categorical (sex - male or female) but rather continuous. The expectation of fitting a model with a Gaussian (normal) is that the residual error (error not explained by the model) is normally distributed around the fitted model:
Here we have a linear model (blue horizontal line in the Fitted model pannel) where there is no effect of a change in the x value on the y values (i.e. the slope of the blue line is ~0). The data are plotted in light blue, and the residual error (difference between the predicted values [i.e. the blue horizontal line] - and the observed values [i.e. the light blue dots] is shown by the grey lines). You can see that the residuals are normally distributed around that fitted blue line (see histogram of residuals).
In effect, we are trying to minimise the residual error as less error = a model which fits the data better. So we are looking for a roughly normal distribution of errors in our histogram of the residuals (top right panel above, ☑), with a roughly linear patter in our fitted vs residual plot (bottom right panel above, ☑).
If we now fit a Gaussian model to count data (e.g. data on the number of a species at a sample site):
You will see that the residuals are very not normally distributed (because we are working with counts which are
Consequently, fitting a Gaussian model to count data like that in the plots above isn’t appropriate.
Using an error distribution other than the Gaussian allows us to deal with data which are non-normally distributed, or which don’t conform to the assumptions of the normal distribution (e.g. are integers). In the above example we would probably use a poisson distribution (specifically designed to cope with positive integer values).
For more information on the error families for glm’s have a look at
the help - ?family.
Link functions - as the name suggests - link two things within our model - they link the data and the error distribution. The easiest way to think of them is a transformation applied to the data before it is passed to the model. The link function relates the mean value of y to the linear predictor (x).
Again, taking the example of the normal distribution - the default link function is an “identity” link. This is basically saying that we don’t need to do any transformation to the data - we are assuming that its error follows the normal distribution.
If we look at the histogram above then the distribution is roughly normal (if we squint a bit) but now our fitted vs residual plot shows a clear banana curve in the plotted points - we want to avoid this (remember, we are expecting there to be no trend in the fitted vs residual plot).
If we then fit this model with a log link:
You can see that it hasn’t changed our histogram of residuals much, but has changed the trend in the fitted vs residuals plot, which now looks good. It has also changed the fit of our regression line, which now curves rather than being straight.
Note - looking for a normal distribution of residuals is useful when the residuals for a distribution are EXPECTED to be normally distributed. This means that residual plots like those above are useful for some error distribution families, but not all.
This is quite a complex topic, and one which can be hard to get your head around. If you aren’t comfortable with these issues then I suggest you do some reading to familiarise yourself with them. Here are a few places to get started:
Today’s practical (particular the second half) will also give us the opportunity to go through an analysis from start to finish, exactly as you would in “real life”. We are going to go through the following steps:
…
etc.
…
Keep this in mind as we work through the following examples.
Time - 60 minutes
So, let’s think about fitting a GLM to data. We are going to work on a few different data sets to show how GLMs can be used in different scenarios (and how flexible they are). First off we are going to briefly go through an example using a GLM to carry out an ANOVA style analysis (looking to see whether there are differences between the mean values of different groups in data). This is conceptually very similar to the t-test you did last week, but with 3 groups rather than comparing a pair of groups. We will then deep-dive into a more complex example later on.
For our simple first example we will analyse data which we worked with last week, the Palmer Penguins flipper lengths:
##load the package
library(palmerpenguins)
##load the iris data set
data(penguins)Take a look at the structure of the data. Try and understand what might be response variables and what might be predictor variables. Discuss this in your groups.
Right, importantly we need to visualize our data. This is critical to understanding the structure of our data, and thus how the data should be analysed.
Task
Plot the flipper width data grouped (split) by species so we can
eyeball whether there might be differences between the three species in
the data. In the below plot I have “jittered” the data, which means we
have artificially added variance to the data on the x-axis so we can see
the points more easily. See if you can do this in your plot (hint -
geom_ is your friend!)
## Plot the flipper length between the different species
ggplot(penguins, aes(x=species, y=flipper_length_mm)) +
geom_jitter(aes(col=species)) +
theme_bw()Great, so it looks like there might be some differences between the
species (particularly between Gentoo and the other two
species). We want to test if these differences are down to chance
(e.g. an artifact of sampling), or whether probability suggests that
these differences are really there between the species. To do this we
want to compare the variance and means of the
flipper_length_mm between the three species.
However, as we have plotted it above it is quite hard to discern what the distributions of the data are (e.g. are they normally distributed?), so we can do a better job at visualising the data.
Histograms are particularly useful for seeing whether data are
normally distributed, and are (luckily) easy to do with
ggplot.
Task
Tweak the following code to make a plot like the one below:ggplot(..., aes(x=..., fill = species)) +
## bin width determines how course the histogram is
## the alpha determines the transparency of the bars
## position allows you to determine what kind of histogram you plot (e.g. stacked vs overlapping). try changing to position="stack"
geom_histogram(binwidth = 1, alpha = .5, position="identity")ggplot(penguins, aes(x=flipper_length_mm, fill=species)) +
geom_histogram(binwidth=1, alpha=.5, position="identity")So from the above we can visualise:
We can also try and make the qq plots we did last week, to check whether the data grouped by species appear normally distributed.
Task
Produce the below plots, adapting the code we used to do this last week.
So we want to test whether there is any significant difference between the mean sepal widths of the three species. This means we have a continuous response (y) variable, and a categorical predictor (x) variable. This makes the coding for our GLM nice and simple. We have visually checked the data and there don’t appear to be any obvious issues with it, so we can forge ahead.
GLM is a base function in R, so we don’t need to load any packages. To run a GLM:
##fit a glm()
mod_flipper <- glm(flipper_length_mm ~ species,
##specify the data
data = penguins,
##specify the error structure
family = "gaussian")Here we specify the formula of the model using ‘~’,
where the first term is the dependant (y/response) variable, and the
following term/terms is/are the independant (x/predictor) variables -
i.e. the ones we believe shoud influence the value of y, but which arent
influenced by the value of y. So flipper_length_mm is our
dependant vairaible, and species is our independant
variable.
Hint - you can read ~ in R as “as a
function of”. So in this case, we are testing whether the Sepal.Width
changes as a function of species
We have also specified the “family” to be “gaussian” (actually we didnt need to do this as this is the default setting, but its good practice to make sure we understand what exactly we are doing). Family here means the error structure, so we are assuming (initially) that our residuals are normally distributed, this means that - given our current settings - we are actually just fitting a *ANOVA to the data.
*an ANOVA is like a t-test but with more than 2 groups.
R will make some assumptions about what kind of predictor variables you want to use based on the type of data the predictor variables are stored as in the data frame. For example, if the predictor variable is numeric then R will assume it is a continuous predictor variable, i.e. one which could take any value (whole or not). If the data is a character or logical vector, then R will assume it is a factorial predictor variable, and that it can only take the values observed in the vector.
Why does this matter? Well, we can use a model fit in R that has a continuous variable to predict what the response variable should be at values of the continuous variable we haven’t yet seen. For example, in the penguin data there are data on the bill length and bill depth. The longest measured bill length is 59.6cm, and this corresponds to a bill depth of 17mm. We might want to know what we would expect the bill depth to be when the length is 65cm. We haven’t observed this, but we could predict this from a model we have fit to these two variables (assuming there is a correlation between the two!).
We can’t do this for factorial predictors, as we have no expectation what other a new factorial value would do to the relationship (e.g. we can’t predict mean flipper length for a new species as we have no idea what that looks like).
We can force R to use numeric predictors as a factorial predictor in our model (which we might want to do if something has been coded numerically but isn’t really continuous) by:
x ~ as.factor(y)Statistics is as much art as science, and assessing how well a model fits data can be complex and time consuming. There isn’t an automated or best approach for doing it.
Remember:
Luckily R has some nifty in built functions for helping us. We will
start of using some of the base methods to do this, and
then move onto some more complex ways later today and next week.
Once we have coded and run the model we have an object (above called
mod_flipper):
##display the class of the model object
class(mod_flipper)## [1] "glm" "lm"
You can see this is a special class, a GLM and
LM (LM just means linear regression). There are some basic
functions for assessing the fits of GLMs in R (try running
plot(mod_flipper)), but - once again - packages can come to
our rescue here, specifically the excellent performace
package.
Task
Install the package and run the check_model function.
See here for
vignette.
check_model() helpfull tells us how we should interpret
the output above - and to me (at least) this all looks fine.
Note - For a more in depth discussion of how to interpret the above diagnostic plots see the excellent blog post here.
We will go more into assessing the fits of a model later on in this session and next week, but for now it looks like this simple modelfits the data well and which we can be pretty confident in. In fact, this is about as good a fit as you are going to get when fitting a model to real data!
So we have fit our model, and ensured that our model fits the data well. Now we want to look at what our model says about the data.
First off let’s look at the model output, using the
summary() function:
##summarise the model outputs
summary(mod_flipper)##
## Call:
## glm(formula = flipper_length_mm ~ species, family = "gaussian",
## data = penguins)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.9536 -4.8235 0.0464 4.8130 20.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.9536 0.5405 351.454 < 2e-16 ***
## speciesChinstrap 5.8699 0.9699 6.052 3.79e-09 ***
## speciesGentoo 27.2333 0.8067 33.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 44.1099)
##
## Null deviance: 67427 on 341 degrees of freedom
## Residual deviance: 14953 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 2270.6
##
## Number of Fisher Scoring iterations: 2
You can see this gives us further details on the model. The important bits for us at this stage are the coefficients.
Note - for a full discussion of the output see the excellent posts here and here.
We can see there is a significant positive effect of both
speciesChinstrap and speciesGentoo. But what
is this measured against? And where is the third species,
Adelie?! Well, R takes the first (alphabetically in our
case as these are categorical predictors) value of our predictor
variable and uses this as our comparison point. This is called the
intercept in the model output.
So, to read our output table:
The estimate value for the intercept is the estimate of the mean
value of the flipper_length_mm for our interecept species
(Adelie) - 189.9536. If we look at our first plot (and here
I’ve added a crossed circle at 3.428):
We can see that that looks about right.
Our values for our other two species are relative to this mean. So to
calculate the mean value for Chinstrap we need to add its
Estimate (from the above table) to the
Intercept:
189.9536 + 5.8699## [1] 195.8235
And the same for Gentoo:
189.9536 + 27.2333## [1] 217.1869
Again, eyeballing the graph above corroborates these estimated mean values for each species. Great!
We can also see there is a significant difference in all three values
in our table (all values in the Pr(>|t|) column are
<0.05 - i.e. there is a less than 5% probability this pattern
occurred by chance).
For the Intercept value, we interpret this as the intercept
Estimate being significantly different from 0 (not very
important in our case).
For the other two values, our interpretation is that they are
significantly different from our intercept (the mean value for
Adelie).
Question
Is the mean value of Chinstrap significantly different
from the mean value for Gentoo?
The simple answer to the above question is…we don’t know. Or at
least, we can’t tell from this analysis. All we are doing here is
testing for differences from the intercept (Adelie). To
discern whether there are differences between the other species we need
to do some further analysis.
To find out whether there are significant differences between our groups we need to do a multiple comparisons test (i.e. compare the differences between all of the categories in our predictor variable). Luckily, this is easy to do in R:
## load the multcomp pack
library(multcomp)
## run the multiple comparisons, and look at the summary output:
summary(glht(mod_flipper, mcp(species="Tukey")))##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: glm(formula = flipper_length_mm ~ species, family = "gaussian",
## data = penguins)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## Chinstrap - Adelie == 0 5.8699 0.9699 6.052 2.75e-09 ***
## Gentoo - Adelie == 0 27.2333 0.8067 33.760 < 1e-09 ***
## Gentoo - Chinstrap == 0 21.3635 1.0036 21.286 < 1e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Question
Using your newly acquired knowledge of how to read model summaries in R: which of the species are significantly different from one another?
Ok, so you’ve done a simple GLM. Now its time to get a bit more complicated, and to do this we are going to generate a data set (as a class) and then analyse it together.
Time - 20 minutes
I will give a introductory on this section so we are all up to speed.
Once you have had this talk you have 20 minutes to go away and create your data, and add it to the spreadsheet.
Each group needs to make a spreadsheet with the following columns (and in this order! so we can combine them all together).
| Group name | Thrower | Attempt | Species | Bucket | Distance | Success |
|---|---|---|---|---|---|---|
| Something funny | Chris Clements | 1 | Blue | 1 | 1.5 | 0 |
| Something funny | Chris Clements | 2 | Red | 1 | 5 | 1 |
| Something funny | Chris Clements | 3 | Green | 5 | 0.5 | 1 |
Things to consider:
To make things as simple as possible please ensure all the ball colours are spelled correctly, and have no capitals or spaces! i.e.: “lightpink” not “Light Pink” etc.
Keep your group name and your name consistent (copy and paste it into new cells). Put your first AND last name into the thrower column.
For bucket volumes enter either 1 (for the small) or 5 (for the large). Nothing else!
For distances enter the distance in meters (so 50cm is entered as 0.5). No need to specify units (i.e. don’t put “.50m”)
Success can be 0 or 1 (it either stayed in the bucket or did not), nothing else.
Use this template to save your data:
Download Blank data table.xlsxI’ve talked over the plan, but in case you need to check anything:
sample() function in ROnce you finish the data collection spend some time thinking about the following questions:
We will discuss these in class.
Time - 85 minutes
In your pre-session reading we covered a workflow for data analysis:
…
etc.
…
We are now going to do this with the data we have just generated. Below under each of the headings are some tips/hints on functions/code which might be useful for each of the above stages.
Task
As a group work through each of the sections below using all the data we have just collected across the groups. Write you code in an accessible way so that you all understand what you have done and why.
vroom()str(), head(), tail()summarise(), unique(),
length(), group_by() might all be usefulThe fitdistrplus package can help you here. It allows
you to plot your data against various distributions (you can specify) to
see whether your data approximately fit the expected distribution:
##load the fitdistrplus library
library('fitdistrplus')
##plot the fit of the observed values against a specified distribution
##what might the distribution be for these data?
plot(fitdist(observed_values,"?"))ggplot(), geom_point(),
stat_smooth(), facet_wrap()metabolic_rate ~ temperatureglm()## an example glm with an interaction and two predictor variables
## note - both of the terms in the interaction are included separately too
mod1 <- glm(metabolic_rate ~ temperature + treatment + temperature:treatment,
data = example_data,
family = "gaussian")We did a brief assessment of the model fit in the first example, but in reality we are going to want to spend more time and energy digging to find out how well out model really does fit the data. This is especially important as the models become more complex.
Luckily, the performance package massively
helps with this, as - depending on what model you have fit
(e.g. Gaussian, etc) - it will update the diagnostic tools to help you
understand how well the model is doing. What do these tests show? Can
you come up with solutions to these issues?
ONE THING TO NOTE - you are never going to get a model which perfectly fits your data! (hence the “all models are wrong…”). We need to get a model which fits as best we can, so that any interpretations of the data are (with the highest likelihood) correct.
Below are some sample code chunks to help us assess our model fits
A really simple first-pass way of assessing how well your model fits your data is simply to plot the predicted values against the observed values. If your model fits well then The relationship between the observed and expected values should be approximately one to one.
##visualize the fitted vs observed values
fit_data <- data.frame("predicted"= your_model$fitted.values,
"observed" = your_data$response_variable)
##plot them
ggplot(fit_data, aes(x=observed, y=predicted)) +
geom_point() +
##add a 1:1 line
geom_abline(intercept = 0) +
##add a linear regression to your data
geom_smooth(method="lm",
col="lightblue",
se = F) +
##set the theme
theme_classic()Question
So, how well does your model fit your data?
The summary of your model can tell us something about how our model fits.
Task
Use Google to try and understand the difference between Null Deviance and Residual Deviance, and then look at the summary of your model and apply what you have just learned.
summary(your_model)There are also a suite of more advanced techniques which include simulating data based on your model fits - we will cover these next week.
So, the next thing to do is fit some alternative models to the data to see if they do a better job. We might explore this even if the model looks like a pretty good fit (remember we are trying to minimise the residual error in the above plots - a model with less residual error fits our data better).
There might be two reasons for doing this:
Question
Should we consider reason (2) for our current analysis?
In the case of reason (1) the data we are working with are counts (number of species at each distance and bucket size), so what options do we have other than the poisson distribution?
Task
Google other options for analysing count data with a
glm() in R.
We aren’t going to fit more models to this data today, as we don’t have time, but we will come back to this next week and think about alternative models, how to fit them, and how to select between them.
…
etc.
…
…
etc.
…
So, we have a final model. Now we need to interpret it! To do this we
need to look at what the model output is - using the
summary() function:
##summarise the model outputs
summary(best_mod)Question
Discuss among your group - can you work out what the summary table is telling you? How does this relate to your null hypotheses? Do your predictor variables have a significant effect?
We may want to simply report the summary statistics from the model (see Interpreting your model), however it is often more powerful (remember - we are visual creatures) to plot the output of your model to visualise what the statistics are telling us.
Luckily (as with most things in R) someone has developed a couple of excellent packages to help us with this. Install the following packages (you will likely need to also install some additional packages, read the error messages!):
library(jtools)
library(interactions)Then try and get to grips with the following functions to visualise your models:
plot_summs()
interact_plot()Question
What do each of these functions allow you to do? Are both useful? Which output is easier to interpret?
Hint - try these excellent help pages on the packages here and here.
So, we have some results, but what do they mean in the context of our hypotheses and the data we collected. Remember why we do analyses - to hypothesis test, or to look for patterns in data, but most of all to answer a question (be that is there higher mortality in people who go sky diving than those who go scuba diving, or if there are more speices on closer, larger islands). Consequently without relating our results and conclusions back to the questions we are asking and putting them in the context of current knowledge in the field our findings become meaningless.
Finally, lets spend a bit of time thinking about the following statement, and how it might relate to our model:
Question
What have we missed?
Consider things we might have failed to account for in our model (e.g. what processess driving the structure of the data have we not explicitly accounted for?). Is it feasible to incorporate these things into the analysis? If yes, should we do this? If not, does this mean we don’t trust our model output?
So, now you know how to fit a simple GLM, and a more complex GLM with multiple predictors. You know how to assess the fit of this model (in a basic way) and visualise the output. Next, we will work on a single population time series from the species data you plotted for homework last week, and see whether there has been a significant increase, decrease, or no change in that population over time.
We will go through the process of fitting a GLM to the data, checking the fit and other potential models, and then think about plotting the outputs of the model.
Let’s quickly summarise what we have learned today:
That’s a lot to cover in one week. Don’t worry if you didn’t manage to work through everything or some parts you find difficult - you can always come back to this and refer to it at any time.
| Function/operator | Call |
|---|---|
| Run a GLM model | glm() |
| Return predicted values from a model | predict() |
| Return the residuals of a glm | resid() |
| Compare models using AICc | AICc() |
| Summarise a model | summary() |
| Order the items in a column | order() |
There isn’t a right or wrong way to do last week’s homework, but you will have got feedback on some of the good and less good points from your Rmarkdown.
##Load in the penguin data
library(palmerpenguins)
##load the tidyverse
library(tidyverse)
##load rstatix
library(rstatix)
##load the penguin data
data(penguins)
##look at the data
head(penguins)
str(penguins)
##filter for gentoo and adelie
G_and_A <- penguins %>%
filter(species==c("Gentoo", "Adelie"))
##plot the data
ggplot(G_and_A, aes(x=species, y=flipper_length_mm, col=species)) +
geom_boxplot() +
geom_jitter(width=0.15) +
theme_classic2()
##check for normality
ggqqplot(G_and_A, x="flipper_length_mm", color="species")
##and histogram
ggplot(G_and_A, aes(x=flipper_length_mm, fill=species)) +
geom_histogram(alpha=0.6, position = 'identity') +
theme_classic2()
##carry out t-test
pengs_T <- G_and_A %>%
##reset levels of the factor
mutate(species=as.character(species)) %>%
##drop any NA values
drop_na() %>%
##also tell it to return a more detailed output
t_test(flipper_length_mm ~ species, detailed=T) %>%
##return the significances too
add_significance()
##look at the results
pengs_TSo we have learned how to fit a GLM today, and plot the output. Let’s apply this knowledge and extend our penguin analysis from earlier. So, for homework write a short Rmarkdown document (submit as a html) to: